Gradient Descent

No Mercy for High Loss - Only Descent

Gradient descent is one of the most popular algorithms to perform optimization and by far the most common way to optimize neural networks.

Gradient Descent (GD) is an optimization algorithm used to minimize the cost or loss function in machine learning and deep learning. It iteratively adjusts parameters (weights) to reduce the error between predicted and actual outcomes. The idea is to move towards the minimum value of the loss function by following the negative gradient. The cost function represents how well a model fits the data. Gradient descent seeks to minimize this cost.

Gradient descent is a way to minimize an objective function \(J(\theta)\), parameterized by a model’s parameters \((\theta \in \mathbb{R}^d )\), by updating the parameters in the opposite direction of the gradient of the objective function, \(\nabla J(\theta)\), with respect to the parameters \(\theta\).

The learning rate, \(\eta\), determines the size of the steps we take to reach a (local) minimum.

Why Gradient Descent?
It works well for optimizing high-dimensional problems, such as neural networks.

#1 Steps Involved in Gradient Descent

1. Initialize the Parameters

  • Begin by initializing the model parameters \(\theta\) with some values (usually small random numbers).
  • These parameters will be updated iteratively to minimize the objective function.

2. Calculate the Gradient

  • Compute the gradient of the objective function \(J(\theta)\) with respect to the parameters \(\theta\). This is the derivative \(\nabla J(\theta)\), which indicates the direction of the steepest ascent.

3. Update the Parameters

  • Update the parameters \(\theta\) in the direction opposite to the gradient to minimize the objective function: \[ \boldsymbol{\theta_{new} = \theta_{old} - \eta \cdot \frac{d}{d\theta}(J(\theta))} \] where \(\eta\) is the learning rate, controlling the step size.

4. Repeat Until Convergence

  • Continue iterating through steps 2 and 3 until the algorithm converges, i.e., the changes in the parameters become very small or the objective function reaches a desired minimum. e.g., \[\theta_{old} - \theta_{new} \approx 0.002\]

5. (Optional) Check for Convergence Criteria

  • Convergence can be determined by one or more of the following:
    • The gradient \(\nabla J(\theta)\) becomes sufficiently close to zero.
    • The change in the value of \(J(\theta)\) between iterations falls below a predefined threshold.
    • A fixed number of iterations is reached.

6. Return the Final Parameters

  • Once the parameters have converged, return the final optimized values of \(\theta\).

Gradient Descent Algorithm

Gradient Descent Pseudo-code

  1. Initialize parameters \(\theta\) (e.g., \(\theta_0, \theta_1, ..., \theta_n\)) randomly or with zeros
  2. Choose learning rate \(\eta\) (a small positive number)
  3. Set the number of iterations (max_iters) or define a convergence criterion

Repeat until convergence or for max_iters times:

  1. Calculate the gradient of the cost function \(J(\theta)\) with respect to each parameter \(\theta\)
    • For each parameter \(\theta_i\): \[\text{gradient}_i = \left[\frac{\partial}{\partial \theta_1}J(\theta), \frac{\partial}{\partial \theta_2}J(\theta), ... , \frac{\partial}{\partial \theta_i}J(\theta)\right]\] (This is the partial derivative of the cost function with respect to \(\theta_i\))
  2. Update the parameters \(\theta\) using the gradient and learning rate \(\alpha\):
    • For each parameter \(\theta_i\): \[\theta_i = \theta_i - \eta \cdot \text{gradient}_i\]
  3. Optionally: Check if the change in the cost function or parameters is small enough (convergence criterion)

Return the optimized parameters \(\theta\)

Taking an example

Let say, if objective function \(J(\theta)\) where \(\theta \in (m, c)\), then Gradient descent will be

repeat until converge {
m[i] = m[i-1] - (learning rate * slope_m)
c[i] = c[i-1] - (learning rate * slope_c)
}

where, slope_m = \(\large \frac{\partial J}{\partial m}\) and slope_c= \(\large \frac{\partial J}{\partial c}\)

  1. If Slope is positive:
    m[i] = m[i-1] - [(learning rate) * (+ slope_m)]

  2. If Slope is negative:
    m[i] = m[i-1] - [(learning rate) * (- slope_m)]
    m[i] = m[i-1] + [(learning rate) * (slope_m)]

Model is said to be converged when m[i] - m[i-1] ≈ 0.01

Finding Minimum from any point

#2 Learning Rate η

The learning rate is a hyperparameter in machine learning that controls the step size at which the parameters \(\theta\) of a objective function \(J(\theta)\) are updated during training.
It’s important to achieve a desirable learning rate because both low and high learning rates can waste time and resources.
- Too large: The algorithm might overshoot the minimum, causing divergence.
- Too small: The convergence will be slow, and it might get stuck in local minima.

A desirable learning rate is one that’s low enough so that the model/network converges to something useful but high enough so that it can be trained in a reasonable amount of time.

Learning Rate effect

#3 Some Concepts

Before understanding Gradient descent, we need to learn some things:

1. Loss function

This represents the error for a single training example. It measures how far off a model’s prediction is from the actual value for a single data point. For example, in regression, the loss for a single example could be the squared error: \[\text{Loss} = (y_i - \hat{y_i})^2\]

2. Cost function

The cost function is typically the average of the loss over the entire training set, giving an overall measure of model error across all data points. For example, in Mean Squared Error (MSE), the cost function over \(m\) np. of samples is: \[J(W) = \frac{1}{m}\sum_{i=0}^{m}(y_i - \hat{y_i})^2\]

3. Convex function

A convex function is a type of function in mathematics and optimization that has a shape where any line segment drawn between two points on the function’s curve lies above or on the curve itself. Convex functions have specific properties that make them particularly important in optimization because they are easier to work with, especially when it comes to finding global minima.

For a convex function:

  • Upward Curvature: The function curves upward or is a straight line.
  • Single Global Minimum: There is either a unique global minimum (the lowest point on the curve) or the function is flat (in which case every point is a minimum).

Example:
\(f(x) = x^2\)
\(f(x) = e^x\)
\(f(x) = ax+b\) etc.

Code
import matplotlib.pyplot as plt
import numpy as np
ipt = np.linspace(-10,10,100)
plt.style.use('fivethirtyeight')
plt.figure(figsize=(5,3))
plt.plot(ipt**2, label="f(x)")
plt.plot(0.5*ipt+70, color='black', linewidth = 1)
plt.title("Convex function", color='brown')
plt.axis('off')
plt.legend()
plt.show()

4. Non-Convex function

A non-convex function have multiple peaks and valleys, resulting in multiple local minima and local maxima. This makes finding the global minimum (the lowest point) challenging because optimization algorithms like Gradient Descent can easily get stuck in one of these local minima rather than finding the lowest possible value.

The loss functions used in deep learning (e.g., mean squared error or cross-entropy) are generally non-convex due to the large number of parameters and the complex network structure, leading to highly irregular optimization landscapes.

For a non-convex function:

  • Multiple Local Minima and Maxima: non-convex functions often have several local minima and maxima.
  • Complex landscape: Non-convex functions often have irregular, oscillating shapes. This creates a “rough” landscape with many ups and downs.
  • Curved in Different Directions: Non-convex functions may have different curvatures in different directions, which can create valleys, peaks, and saddle points.

Examples:
\(f(x) = x^4 - x^3\)
\(f(x) = \text{sin}(x)\) etc.

Code
import matplotlib.pyplot as plt
import numpy as np
import matplotx
plt.style.use('fivethirtyeight')
ipt = np.linspace(0,600,100)*np.pi/180
plt.figure(figsize=(5,3))
plt.plot(np.sin(ipt)+0.08*ipt, label='f(x)')
plt.plot(0.05*ipt+0.44, color='black', linewidth=1)
plt.axis('off')
plt.title("Non Convex function", color='brown')
plt.legend(loc='lower left')
plt.show()

Feature
Convex Function
Non-Convex Function
Shape Curves upward, or is flat/linear Complex, with multiple peaks and valleys
Global Minimum Has a unique global minimum or is flat Can have multiple local minima and a global minimum
Local Minima No local minima other than the global minimum May have multiple local minima, leading to optimization challenges
Optimization Easier to optimize, as Gradient Descent is guaranteed to find the global minimum Difficult to optimize, as Gradient Descent can get stuck in local minima
Examples Quadratic (f(x) = x2), exponential (f(x) = ex), linear Trigonometric (f(x) = sin(x)), polynomial (f(x) = x4 - x2)
Second Derivative Positive or zero (i.e., f’’(x) ≥ 0 for all x) Can be positive, zero, or negative, varies across the domain
Gradient Descent Converges to global minimum reliably May converge to a local minimum or saddle point, not guaranteed to find global minimum

#4 Optimization challanges in Deep Learning

Optimization in the case of convex functions is a relatively straightforward task. In machine learning, for example, the loss function for linear regression is the mean squared error (MSE), which is always convex. This means that there is only a single optimal minimum which is the global minimum in the entire MSE function, and we do not need to worry about the existence of local optima.

However, in the case of neural networks, we have many parameters (weights and biases) to train in order to get the optimal solution and minimize the loss. This is because neural networks are composed of many non-linear functions that are stacked (layer after layer) on top of each other. The composition of non-linear functions can produce a non-convex loss function. Non-convex functions have many local minima, out of which only one is the global minimum. Our goal is to reach the global minimum by avoiding these local minima.

Local Minima, Global Minima and Saddle Point

1. The Local Minima

Non-convex functions often have multiple local minima, where the loss function reaches a low point but not the global minimum.

Gradient Descent and its variants can get stuck in these local minima, causing the network to converge to suboptimal solutions. This means the model may not perform as well as it could if it found the global minimum.

Techniques like Stochastic Gradient Descent (SGD), momentum, and adaptive optimizers (e.g., Adam) help by introducing some variability in the optimization path, making it possible to escape shallow local minima.

Local vs Global minima

2. The Initial Guess problem

Choosing the right hyperparameters is crucial for optimization in neural networks, and the learning rate is one of the most important. The learning rate controls the step size in each update during gradient descent. If we set it too low, the optimization process will be slow, requiring many updates to approach the minimum. Conversely, if we set the learning rate too high, the updates may overshoot the minimum, leading to oscillations or even divergence, preventing convergence.

The learning rate also plays a role in avoiding local minima. In a non-convex loss landscape, local minima are points where the loss is lower than the nearby region but not the lowest possible value (the global minimum). Local minima typically have low gradients around them, making it easy for the optimizer to settle there. By setting a larger learning rate, the optimizer can “jump” over these shallow valleys, avoiding some local minima and continuing its search for a better solution. However, a high learning rate can also cause the optimizer to miss deeper minima (like the global minimum) by overshooting them.

Code
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('default')

# Define a non-convex function with a local and global minima
def f(x):
    return 3*x**4 - 5*x**2 + x

def g(x):
    return 12*x**3 - 10*x + 1

# Parameters
learning_rates = [0.07, 0.05]  # Different learning rates to demonstrate overshooting
initial_position = 1.4  # Start point near the local minimum
iterations = 40  # Number of iterations for the gradient descent
max_position_value = 10  # Maximum value for position clipping

# Prepare the plot
fig, axes = plt.subplots( 2,1, figsize=(8,10))

# Loop through each learning rate and plot in a separate subplot
for i, learning_rate in enumerate(learning_rates):
    # Initialize the ball position
    position = initial_position
    trace_x = [position]
    trace_y = [f(position)]

    # Perform gradient descent with the current learning rate
    for _ in range(iterations):
        gradient = g(position)
        position = position - learning_rate * gradient
        
        # Clip the position to avoid it growing too large (overflow safeguard)
        if abs(position) > max_position_value:
            position = max_position_value * np.sign(position)
        
        trace_x.append(position)
        trace_y.append(f(position))

    # Plotting the function
    x = np.linspace(-1.5, 1.5, 400)
    y = f(x)

    ax = axes[i]
    ax.plot(x, y, label=r'$f(x) = 3x^4 - 5x^2 + x$', color='b')  # Function plot

    # Mark the local and global minima
    local_min_x = 0.85
    global_min_x = -0.95
    ax.scatter([global_min_x, local_min_x], [f(global_min_x), f(local_min_x)], color='red', zorder=5)

    # Mark the minima points with legends
    ax.text(global_min_x, f(global_min_x) - 0.5, 'Global Minimum', color='red', fontsize=12, ha='center')
    ax.text(local_min_x, f(local_min_x) - 0.5, 'Local Minimum', color='red', fontsize=12, ha='center')

    # Plot the ball trace
    ax.plot(trace_x, trace_y, 'go-', label='Ball Trace (Overshooting)' if learning_rate == 0.07 else 'Ball Trace (No Overshooting)')

    # Title and labels
    ax.set_title(f'Demonstration with Learning Rate = {learning_rate}')
    ax.set_xlabel('x')
    ax.set_ylabel('f(x)')
    ax.axhline(0, color='black', linewidth=1)
    ax.axvline(0, color='black', linewidth=1)
    ax.set_ylim(-4, 4)
    ax.legend()
    ax.grid(True)

plt.suptitle("Comparison of Gradient Descent with Different Learning Rates", fontsize=16)
plt.tight_layout()
plt.show()

3. Saddle points and Pleteaus

In high-dimensional spaces (such as those in deep neural networks), saddle points or minimax point are very common. A saddle point is a point on the surface of a function where the slope is zero (like at a minimum or maximum), but it’s neither a true minimum nor a true maximum.

Imagine you’re standing on a mountain pass: a point between two peaks that is the lowest point along a ridge, but not the lowest point in all directions. In one direction, the ground slopes up toward the peaks on either side (so it’s a low point in that direction). But if you look the other way, it slopes down into valleys (making it a high point in that direction).

In the context of optimization (like gradient descent in machine learning), a saddle point is tricky because the slope is zero, so it may “fool” the algorithm into thinking it has reached a minimum or maximum. However, it’s just a flat spot that doesn’t represent the lowest or highest possible point overall.

  • It is said minimax point as from one direction, it is the local minimum point, and from another direction, it is local maximum point.
  • It has name saddle point because its shape looks like a saddle on horse.
Code
import numpy as np
import matplotlib.pyplot as plt
plt.figure(figsize=(6,4))
plt.style.use('fivethirtyeight')
x = np.arange(-2.0, 2.0, 0.01)
plt.plot(x, x**3)
plt.scatter(0,0, color='red',zorder=5)
plt.annotate('Saddle Point', xy =(0,0), 
                xytext =(0.5, -6),  
                arrowprops = dict(facecolor ='green', 
                                  shrink = 0.05),) 
plt.axis('off')
plt.grid()

A saddle point (in red) on the graph of \(z = x^2 − y^2\)

Saddle point between two hills

Non-convex functions can have flat regions or plateaus where the gradient is close to zero over a range of values. A plateau is a flat, level area in a landscape. Imagine a long stretch of land where there’s almost no change in elevation; it’s just flat and goes on for a while without any uphill or downhill movement. In the context of a loss landscape (the landscape of values that show how good or bad a model’s predictions are), a plateau is an area where the loss value doesn’t change much as you move across it. When a gradient descent algorithm (the “walker”) enters a plateau, it struggles to find a clear direction to go in because the “slope” is very shallow or zero, making progress slow or causing it to get “stuck” temporarily.

Plateau in region
  • Saddle Point: A single point with both upward and downward slopes in different directions.

  • Plateau: A wide, flat area with no steep slope in any direction, causing slow movement through it.

4. Sensitivity to Initial Parameters

Gradient Descent’s convergence can be highly dependent on the initial values of the parameters. Poor initialization can lead to slower convergence or getting stuck in poor local minima (for non-convex functions).
In deep learning, poor initialization can result in dead neurons (especially in ReLU networks) and significantly slow down training.
Use weight initialization techniques like Xavier or He initialization, which take into account the layer size and the type of activation function. Randomized initialization is also better than starting with zeros, as it breaks symmetry.

5. Scaling and Feature Normalization

Gradient Descent is sensitive to the scale of the features. When features have widely different scales, the algorithm might struggle with convergence.
Unscaled features can lead to erratic updates, slow convergence, and suboptimal results.
Use feature scaling (such as standardization or normalization) to ensure that features are on a similar scale.

image.png

#5 Types of Gradient Descent

There are three variants of Gradient Descent, which differ in how much data we use to compute the gradient of the objective function.
Depending on the amount of data, we make a trade-off between the accuracy of parameter update and the time it takes to perform an update.

1. Batch Gradient Descent (Vanilla)

Batch Gradient Descent is an optimization algorithm used to minimize the cost function of a model by updating its parameters in the direction that reduces the overall error.

The gradient is calculated using the entire dataset at each iteration. The term “batch” indicates that we use the full batch of data points in each update. This makes the updates more stable but can be computationally expensive if the dataset is very large.

Consider this Pseudo code:

Set learning rate η
Set number of epochs
for epoch in range(epochs):
   # Step 1: Forward Pass i.e., Predict y_hat for all data
   y_hat = np.dot(X, W)

   # Step 2: Calculate Cost function (e.g., Mean Squared Error)
   Cost, L = (1 / m) * sum((y - y_hat) ** 2) # m : no. of samples

   # Step 3: Calculate Gradient
   gradient = dL/dW

   # Step 4: Update Weights
   W = W - η * gradient

   # Optionally print or store loss for monitoring
   print(“Epoch:”, epoch, “Loss:”, loss)

   Return final weights W

Explanation

1. Initialize weights: Start with random weights \(W\)
2. Learning rate and Epochs: Define the learning rate \(\eta\) and the number of epochs.
3. Loop through epochs:
   3.1 Calculate Predictions: Compute predictions \(\hat{y}\) for each data point \(x_i\) using the dot product:
   y_hat = np.dot(X,W)
   3.2 Compute Cost Function: Use the appropriate Cost function \(J(W)\). This provides a measure of the model’s error with the current Weights for entire dataset.
   3.3 Calculate Gradient \(\boldsymbol{\nabla_WJ(W)}\): The gradient is the derivative of the Cost function w.r.t each weight \(W\), \(\large \frac{\partial J(W)}{\partial w}\) , which we use to update weights \(W\).

4. Update Weights: Update the weights \(W\) by moving in the opposite direction of the gradient:\[{W = W - \eta \cdot \nabla_{W} J(W)}\]

Repeat steps 3.1 - 3.3 for a specified number of epochs or until the cost function converges (i.e., does not decrease significantly with further updates).

Number of times weight update = number of epochs.

Gradient Descent Convergence

Contour plot of Convex function with Batch GD reaching on Global minimum smoothly
Advantages
  1. Steady and Accurate Convergence to a Minimum:
    Batch Gradient Descent will normally yield a direct path to a minimum. This means little time will be spent moving in incorrect directions in weight space if the learning rate \(\eta\) is not too big. This also means Batch Gradient Descent will yield a very accurate estimate of the minimum as shown above.

  2. Easy Implementation:
    The Batch Gradient Descent algorithm is quite simple to implement in code, when compared to other variations. The number of hyperparameters introduced is kept to a minimum, and the implementation can be fully vectorised.

  3. Excellent for Convex Functions:
    Batch Gradient Descent works very well for convex functions with smaller-sized datasets. In such a scenario, all derivatives point to the single global minimum, and there will be no issue loading all the training data into memory for processing.

Limitations
  1. Unpractical for Larger Datasets:
    As all the training data is used to compute each step in our algorithm, this approach is very computationally expensive. All the training data must be loaded into memory. For Big Data, this simply isn’t feasible.

  2. Slow to Converge:
    Related to the first point; since all the training data is used to compute each step, this puts strain on the processing resources we have available. The algorithm may only slowly converge towards a minimum for larger datasets.

  3. Difficult to Escape Local Minima:
    For non-convex loss functions, many local minima will be present in weight space. Batch GD tends to lead towards a direct path to the nearest minimum. As such, this algorithm can get stuck at a local minimum in these situations.

  4. Not Suitable for Online or Real-Time Learning:
    Batch Gradient Descent requires the entire dataset for each update, which makes it unsuitable for online learning (where data is received continuously over time).
    In applications like recommendation systems, stock market prediction, or fraud detection, where data arrives sequentially, Gradient Descent would need to retrain the model repeatedly on the entire dataset, which is inefficient.

2. Stochastic Gradient Descent

Stochastic Gradient Descent (SGD) is a variant of the classic Gradient Descent (GD) used to train machine learning models, including neural networks. While Batch Gradient Descent computes the gradient using the entire dataset, SGD computes the gradient based on a single randomly chosen data point.

Consider this Pseudo code:

Set learning rate η
Set number of epochs
loss_at_each_epoch = []

for epoch in range(epochs):
    # shuffle the data.

    epoch_loss = 0

    for j in range(X.shape[0]):
        # Step 1: predicting y_hat for single data point x[j]
        y_hat[j] = weight_transpose * x[j] + bias

        # Step 2: Calculate Loss (e.g., Mean Squared Error)
        Loss, L = (y_hat[j] - y[j]) ** 2
        epoch_loss = epoch_loss + Loss
        # Step 3: Calculate Gradient
        gradient = dL/dW , dL/db

        # Step 4: Update Weights and bias
        W = W - η * gradient
        b = b - η * gradient

   # calculate average loss for each epoch
    average_loss = epoch_loss / X.shape[0]
    loss_at_each_epoch.append(average_loss)

    print(“Epoch:”, epoch, “Loss:”, loss_at_each_epoch[epoch])

    Return final weights W

Explanation

1. Initialize Parameters: The model weights and bias are initialized randomly.
2. Learning rate and Epochs: Set the learning rate \(\eta\) and the number of epochs.
3. Loop through epochs: Repeat for each epoch until the specified number of epochs.
    3.1 Shuffling: Randomly shuffle the dataset at the beginning of each epoch to ensure training order doesn’t impact the model.
    3.2 Loop through each training sample:
       3.2.1 Calculate prediction: Compute prediction \(\hat{y_j}\) for each row selected at random using \[\hat{y} = W^T \cdot x + b\]        3.2.2 Compute Loss: Compute loss \((y - \hat{y_j})^2\) for each prediction.
       3.2.3 Calculate Gradient \(\boldsymbol{\nabla_WJ(W)}\): Compute the gradient \(\frac{\partial J(W)}{\partial w}\) of the loss function with respect to the model’s parameters{weights and bias} for that single sample.
       3.2.4 Update Weights: Update the weights \(W\) by using the gradient:\[{W = W - \eta \cdot \nabla_{W} J(W)}\]

4. Loop through epochs:Track the loss for each sample and compute the average loss at the end of the epoch. This provides insight into how the model is improving..
Repeat the step 3 for a specified number of epochs or until the convergence (defined as when the average loss does not significantly decrease over consecutive epochs.)

Number of times weight update = (number of epochs) x (number of training samples)

Stochastic gradient descent near a piecewise linear saddle point.

Contour plot of Convex function with SGD reaching on Global minimum with much noise
Code
import numpy as np
import plotly.graph_objects as go
from pydataset import data
import plotly.express as px
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import cufflinks as cf
init_notebook_mode(connected=True)
cf.go_offline()
# Generate a simple dataset for a linear regression task with 2D weights
np.random.seed(42)
X_data = np.linspace(-2, 2, 100).reshape(-1, 1)
X_data = np.hstack((X_data, np.ones_like(X_data)))  # Adding a bias column
y_data = 3 * X_data[:, 0] + 2 + np.random.normal(0, 0.5, X_data.shape[0])  # Linear relationship with noise

# Define the loss function (Mean Squared Error) and gradient
def loss_function(W, X, y):
    # Add a sinusoidal term to make the function non-convex
    preds = W[0] * X[:, 0] + W[1] * X[:, 1]
    # The sinusoidal term introduces local minima
    return np.mean((preds - y)**2) + 5 * np.sin(W[0]) * np.cos(W[1])

def gradient(W, x, y):
    # Calculate the predictions and error
    preds = W[0] * x[:, 0] + W[1] * x[:, 1]
    error = preds - y
    
    # Compute the gradient with an additional sinusoidal component
    grad_w0 = np.mean(2 * error * x[:, 0]) + 5 * np.cos(W[0]) * np.cos(W[1])
    grad_w1 = np.mean(2 * error * x[:, 1]) - 5 * np.sin(W[0]) * np.sin(W[1])
    
    return np.array([grad_w0, grad_w1])

# Initialize parameters for SGD
learning_rate = 0.1
epochs = 1
W = np.array([-2.0, -2])  # Starting with higher values for weights

# Lists to store weight and loss values for plotting
W_vals, loss_vals = [W.copy()], [loss_function(W, X_data, y_data)]

# Perform SGD
for epoch in range(epochs):
    # Shuffle data indices
    indices = np.random.permutation(len(X_data))
    X_data_shuffled = X_data[indices]
    y_data_shuffled = y_data[indices]
    
    # Update weights for each sample in the dataset
    for i in range(len(X_data_shuffled)):
        x_i = X_data_shuffled[i].reshape(1, -1)
        y_i = y_data_shuffled[i]
        
        # Compute gradient using single data point
        grad = gradient(W, x_i, np.array([y_i]))
        
        # Update weights
        W -= learning_rate * grad
        
        # Store values for 3D plot
        W_vals.append(W.copy())
        loss_vals.append(loss_function(W, X_data, y_data))

# Convert weight values into 3D coordinates for plotting
W_vals = np.array(W_vals)
loss_vals = np.array(loss_vals)
x_path, y_path = W_vals[:, 0], W_vals[:, 1]
z_path = loss_vals

# Create a meshgrid for the 3D surface of the convex function
x_range = np.linspace(-3, 6, 50)
y_range = np.linspace(-3, 6, 50)
X, Y = np.meshgrid(x_range, y_range)
Z = np.array([loss_function([x, y], X_data, y_data) for x, y in zip(np.ravel(X), np.ravel(Y))])
Z = Z.reshape(X.shape)

# Plotting with Plotly
fig = go.Figure()

# Add the 3D surface for the convex function (loss landscape)
fig.add_trace(go.Surface(z=Z, x=X, y=Y, colorscale="Viridis", opacity=0.8))

# Add the noisy SGD path as a scatter plot
fig.add_trace(go.Scatter3d(
    x=x_path, y=y_path, z=z_path,
    mode='markers+lines',
    marker=dict(size=5, color='red'),
    line=dict(color='red', width=3),
    name="Noisy SGD Path"
))

# Labels and layout adjustments
fig.update_layout(
    title="Stochastic Gradient Descent (SGD) Path on Convex Loss Function",
    scene=dict(
        xaxis_title='Weight 1 (W[0])',
        yaxis_title='Weight 2 (W[1])',
        zaxis_title='Loss (MSE)',
    )
)

fig.show()
Advantages
  1. Faster Convergence (Per Iteration):
    Stochastic Gradient Descent (SGD) is computationally faster than Batch Gradient Descent (BGD) because it updates parameters based on a single data point rather than the entire dataset. This makes each step of SGD quicker to compute, especially on large datasets.

  2. Escapes Local Minima:
    The noisy nature of SGD (due to random updates) helps it potentially escape local minima or saddle points and can help find a better global minimum.

  3. Online Learning:
    SGD can be used for online learning, where the model continuously learns as new data becomes available. It’s well-suited for situations where the data cannot be stored in memory all at once.

  4. Memory Efficiency:
    Only one (or a few) data points are needed at each iteration, so SGD is more memory efficient compared to batch GD, which requires storing the entire dataset in memory.

  5. Better Generalization:
    The stochasticity can act as a form of regularization, making the model generalize better on unseen data, as it prevents overfitting to the training set.
    In SGD, randomness introduces noise into each update step. This randomness helps the algorithm explore the loss surface but can prevent exact convergence to the minimum, resulting in an approximate solution that oscillates around a good region rather than settling precisely at the minimum.
    Exact convergence is often unnecessary. Slight variations in weights may not significantly impact model performance. Thus, practitioners often accept approximate solutions that offer satisfactory generalization on unseen data, rather than seeking an exact solution that may overfit the training data.
    Exact optimization is often computationally expensive and, for large datasets and models, may be infeasible within reasonable time limits. Gradient descent methods are designed for efficiency, trading off exactness for speed and scalability.
Disadvantages
  1. Hyperparameter Sensitivity:
    The learning rate in SGD is crucial. A high learning rate might cause overshooting, while a low learning rate might result in slow convergence. Finding the optimal learning rate requires experimentation.
  2. Vanishing or Exploding Gradients (in Deep Networks):
    In deep neural networks, the gradients can become very small (vanishing) or very large (exploding) during backpropagation, which can hinder the training process.
  3. Slower Execution (per epoch):
    Total Convergence Time is often slower in SGD due to the high variance in updates. Since SGD doesn’t compute the gradient over the whole dataset in each step, each individual update has some noise, causing it to take a more erratic path toward the minimum. This high variance can lead to more steps needed overall, potentially resulting in more time to converge to a stable region around the minimum.
    To reduce this issue, techniques like learning rate decay and momentum can help smooth the path to convergence, but these still don’t eliminate the noisy nature of SGD entirely.

Considering a code example

Code
import numpy as np
import pandas as pd
import time
df = pd.read_csv("https://raw.githubusercontent.com/shivang98/Social-Network-ads-Boost/refs/heads/master/Social_Network_Ads.csv")
print("Using a Social_network_data with Shape: ", df.shape)
df.head()
Using a Social_network_data with Shape:  (400, 5)
User ID Gender Age EstimatedSalary Purchased
0 15624510 Male 19 19000 0
1 15810944 Male 35 20000 0
2 15668575 Female 26 43000 0
3 15603246 Female 27 57000 0
4 15804002 Male 19 76000 0
## Considering only [Age, EstimatedSalary] as features 
## and [Purchased] as target.

df = df[['Age','EstimatedSalary','Purchased']]
x = df.iloc[:,0:2]
y = df.iloc[:,-1]
print("Input features.. ")
x.head()
Input features.. 
Age EstimatedSalary
0 19 19000
1 35 20000
2 26 43000
3 27 57000
4 19 76000
## Scaling the features before feeding into neural network,

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
x_scaled = scaler.fit_transform(x)
## Performing train-test split

from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x_scaled,y,test_size=0.2,random_state=2)
Code
print("Training Data shape ", x_train.shape)
Training Data shape  (320, 2)
## Neural network architecture

import tensorflow as tf
from tensorflow import keras
from keras import Sequential
from keras.layers import Dense
import time

model = Sequential()
model.add(Dense(10,activation='relu',input_dim=2))
model.add(Dense(10,activation='relu'))
model.add(Dense(1,activation='sigmoid'))
model.summary()
Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense_3 (Dense)             (None, 10)                30        
                                                                 
 dense_4 (Dense)             (None, 10)                110       
                                                                 
 dense_5 (Dense)             (None, 1)                 11        
                                                                 
=================================================================
Total params: 151 (604.00 Byte)
Trainable params: 151 (604.00 Byte)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
## Training the network using Batch GD
## for Batch GD, we set batch_size = x_train.shape
## so that all data is used per epoch to calculate loss

model.compile(loss='binary_crossentropy', metrics=['accuracy'])
start = time.time()
history = model.fit(x_train,y_train, epochs=10, batch_size=320)
print("\nTime taken in Batch GD: ", time.time() - start)
Epoch 1/10
1/1 [==============================] - 0s 443ms/step - loss: 0.6964 - accuracy: 0.4875
Epoch 2/10
1/1 [==============================] - 0s 5ms/step - loss: 0.6819 - accuracy: 0.5312
Epoch 3/10
1/1 [==============================] - 0s 6ms/step - loss: 0.6719 - accuracy: 0.5875
Epoch 4/10
1/1 [==============================] - 0s 6ms/step - loss: 0.6637 - accuracy: 0.6187
Epoch 5/10
1/1 [==============================] - 0s 4ms/step - loss: 0.6566 - accuracy: 0.6469
Epoch 6/10
1/1 [==============================] - 0s 6ms/step - loss: 0.6503 - accuracy: 0.6562
Epoch 7/10
1/1 [==============================] - 0s 4ms/step - loss: 0.6445 - accuracy: 0.6562
Epoch 8/10
1/1 [==============================] - 0s 7ms/step - loss: 0.6392 - accuracy: 0.6719
Epoch 9/10
1/1 [==============================] - 0s 4ms/step - loss: 0.6341 - accuracy: 0.6812
Epoch 10/10
1/1 [==============================] - 0s 7ms/step - loss: 0.6293 - accuracy: 0.6906

Time taken in Batch GD:  0.5984237194061279
## Training the network using Stochastic GD
## for Stochastic GD, we set batch_size = 1
## to update gradient based on each sample

model2 = Sequential()
model2.add(Dense(10,activation='relu',input_dim=2))
model2.add(Dense(10,activation='relu'))
model2.add(Dense(1,activation='sigmoid'))

model2.compile(loss='binary_crossentropy', metrics=['accuracy'])
start = time.time()
history2 = model2.fit(x_train,y_train, epochs=10, batch_size=1)
print("\nTime taken in SGD: ", time.time() - start)
Epoch 1/10
320/320 [==============================] - 1s 1ms/step - loss: 0.5029 - accuracy: 0.7781
Epoch 2/10
320/320 [==============================] - 1s 2ms/step - loss: 0.3747 - accuracy: 0.8094
Epoch 3/10
320/320 [==============================] - 0s 2ms/step - loss: 0.3233 - accuracy: 0.8469
Epoch 4/10
320/320 [==============================] - 0s 1ms/step - loss: 0.2998 - accuracy: 0.8500
Epoch 5/10
320/320 [==============================] - 0s 1ms/step - loss: 0.2863 - accuracy: 0.8656
Epoch 6/10
320/320 [==============================] - 0s 1ms/step - loss: 0.2743 - accuracy: 0.8781
Epoch 7/10
320/320 [==============================] - 0s 1ms/step - loss: 0.2701 - accuracy: 0.8875
Epoch 8/10
320/320 [==============================] - 0s 1ms/step - loss: 0.2646 - accuracy: 0.8906
Epoch 9/10
320/320 [==============================] - 0s 1ms/step - loss: 0.2610 - accuracy: 0.9000
Epoch 10/10
320/320 [==============================] - 0s 1ms/step - loss: 0.2550 - accuracy: 0.9031

Time taken in SGD:  4.244263172149658

Comparing the time and accuracy of Batch GD and SGD, we can say:

  1. Batch GD is faster per epoch as weight is updated one time per epoch only. But accuracy is 69% indicating that more epochs are needed.

  2. SGD is slower per epoch as weight is updated 320 times per epoch. But accuracy is 89% indicating convergence is achieved.

# plotting Loss graph for SGD with 500 epochs

model2 = Sequential()
model2.add(Dense(10,activation='relu',input_dim=2))
model2.add(Dense(10,activation='relu'))
model2.add(Dense(1,activation='sigmoid'))

model2.compile(loss='binary_crossentropy', metrics=['accuracy'])
history2 = model2.fit(x_scaled, y, epochs=500, batch_size=1, validation_split=0.2)
Code
import matplotlib.pyplot as plt
plt.plot(history2.history['loss'])
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.title('SGD loss @ 500 epochs',  color='brown')
plt.show()

# plotting Loss graph for SGD with 500 epochs

model = Sequential()
model.add(Dense(10,activation='relu',input_dim=2))
model.add(Dense(10,activation='relu'))
model.add(Dense(1,activation='sigmoid'))

model.compile(loss='binary_crossentropy', metrics=['accuracy'])
history = model.fit(x_train,y_train, epochs=500, batch_size=320)
Code
plt.plot(history.history['loss'])
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.title('Batch GD loss @ 500 epochs', color='brown')
plt.show()

Batch Gradient Descent produces smoother loss curves in a stable manner, while Stochastic Gradient Descent introduces noise in unstable manner due to weight updates on individual data points.

Stochastic GD vs Batch GD

Feature
Stochastic GD
Batch GD
Update Frequency Updates weights after each training example or mini-batch Updates weights after the entire training set
Computational Complexity Low per iteration, but requires more iterations (epochs) High per iteration (processing the entire dataset)
Convergence Noisy, but can escape local minima, may oscillate around minima Smooth convergence, but may get stuck in local minima
Memory Usage Low memory usage (only one data point or mini-batch) High memory usage (needs entire dataset in memory)
Ideal Use Case Large datasets, online learning, deep learning (non-convex) Small to medium-sized datasets, convex problems
Generalization Better generalization due to noisiness and randomness, but provides approximate solution, not exact one. May overfit if not regularized properly, but provide exact solution.
Convergence Speed Slower (needs more epochs) but faster per iteration Faster convergence (less noisy), but slow due to the full batch

Why randomness and noise in SGD?

The noise and instability in Stochastic Gradient Descent (SGD) primarily arise from updating the model parameters based on individual data points instead of the entire dataset or a large batch.

Since each sample carries unique characteristics (and sometimes outliers or noise), the gradient calculated from a single data point may not accurately represent the direction to the true minimum.

This results in each update step being influenced by the particularities of the sample, leading to high variance in the updates and a more erratic, zigzagging path toward convergence.

Why Doesn’t SGD Cause Overfitting?

Overfitting happens when your model fits too closely to the training data, learning all the small noise and idiosyncrasies. This makes it less able to generalize to new data. SGD prevents it like this:

  • Noisy Updates Reduce Memorization: SGD uses a single sample per update. This randomness introduces noise, which disrupts the model’s ability to fit too precisely to the training data. In doing so, SGD prevents the model from “memorizing” specific data points, encouraging it to generalize better to unseen data.

  • Avoidance of Sharp Minima: The noise in SGD often prevents it from settling into sharp, narrow minima in the loss landscape, which can indicate overfitting. Sharp minima are highly tuned to training data specifics and usually perform poorly on test data. Instead, the noisy path of SGD encourages it to find flatter minima, which tend to represent solutions that generalize better across different data.

  • Regularization Effect Without Additional Terms: By not fully converging on a precise minimum, SGD implicitly regularizes the model, helping it to avoid overfitting without the need for additional penalty terms in the loss function.

In practice, this makes SGD especially effective in deep learning and large datasets, where generalization is key.

What is Vectorization?

Vectorization in the context of machine learning and optimization refers to the process of rewriting code to perform mathematical operations on entire arrays (or vectors) at once, rather than using explicit loops to handle individual elements. It takes advantage of optimized low-level libraries (such as NumPy, TensorFlow, or PyTorch) that allow operations on entire datasets in a parallelized and efficient manner.

In Gradient Descent, vectorization is used to apply the gradient update rules to multiple data points simultaneously. Instead of computing the output for each data point one by one (as in SGD), vectorized operations enable computing the output for all data elements or mini-batch elements in parallel, which accelerates the training process.

For example:

  • In non-vectorized code, you’d loop through each data point, compute the output, and update the weights sequentially.

    for i in range(x.shape[0]):
       y_hat[i] = X[i] * weights + bias

    or

    \(\hat{y} = \sum_i^n (X_i * \text{weights} + \text{bias})\)

  • In vectorized code, the gradient for the entire mini-batch can be computed in one step, allowing for faster calculations, especially when using hardware acceleration (like GPUs). It uses matrix.

    predictions = np.dot(X, weights) + bias

    or

    \(\hat{y} = X^T \cdot \text{weights} + \text{bias}\)

3. Mini-Batch Gradient Descent

Mini-Batch Gradient Descent (MBGD) is a variant of gradient descent that splits the dataset into smaller “mini-batches,” which are subsets of data typically ranging from 32 to 256 samples each. For each mini-batch, MBGD computes the gradient and updates the model weights, combining aspects of both Batch Gradient Descent (BGD) and Stochastic Gradient Descent (SGD).

Advantages
  1. Balanced Convergence and Efficiency:
    Mini-batches allow for a compromise between BGD and SGD, offering both the smoother convergence of BGD and the computational efficiency of SGD. By updating weights more frequently than BGD but less noisily than SGD, MBGD strikes a balance in speed and stability.
    With each mini-batch update, the gradient is averaged over multiple samples, reducing the variance in weight updates while retaining some of the randomization that helps in exploring the loss surface effectively.

  2. Faster Computation with Parallel Processing:
    Mini-batch updates are faster than BGD since fewer data points are processed in each iteration. They also enable parallel processing on GPUs using vectorization, further speeding up the training process on large datasets​

  3. Reduced Memory Requirements:
    By dividing the dataset into smaller subsets, MBGD can handle larger datasets than BGD, as it doesn’t require loading the entire dataset into memory at once.

  4. Improved Generalization:
    The small level of noise introduced by mini-batches (compared to BGD) prevents overfitting to a degree, helping the model generalize better on unseen data. This noise is enough to keep the model from settling into sharp, narrow minima, which are often associated with overfitting.

Limitations
  1. Hyperparameter Tuning:
    MBGD involves tuning several parameters, including mini-batch size and learning rate. The ideal mini-batch size can vary depending on the model and dataset, and the choice of size impacts convergence speed and stability. This added complexity can make MBGD harder to optimize.

  2. Not Suitable for Online or Real-Time Learning:
    Mini-batch Gradient Descent (MBGD) is generally not used in online learning because it still relies on processing batches of data, whereas online learning typically refers to algorithms that update model parameters after processing a single data point (like Stochastic Gradient Descent, SGD). Online learning involves sequential updates that can adapt to data arriving in real-time, such as when data is streaming.
Feature Mini-Batch Gradient Descent (MBGD) Stochastic Gradient Descent (SGD)
Update Frequency After each mini-batch After each individual data point
Computational Efficiency High with parallel processing Efficient per iteration but slower for large datasets
Convergence Smoother, more stable Noisy, high variance
Memory Usage Lower than Batch Gradient Descent Low
Ideal Use Case Large datasets, balanced optimization Real-time updates, online learning
Regularization Effect Implicit regularization with reduced noise Implicit regularization but noisier convergence
Code
import numpy as np
import pandas as pd
import time
df = pd.read_csv("https://raw.githubusercontent.com/shivang98/Social-Network-ads-Boost/refs/heads/master/Social_Network_Ads.csv")
print("Using a Social_network_data with Shape: ", df.shape)

df = df[['Age','EstimatedSalary','Purchased']]
x = df.iloc[:,0:2]
y = df.iloc[:,-1]

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
x_scaled = scaler.fit_transform(x)
Using a Social_network_data with Shape:  (400, 5)
# training network using mini batch

model3 = Sequential()
model3.add(Dense(10,activation='relu',input_dim=2))
model3.add(Dense(10,activation='relu'))
model3.add(Dense(1,activation='sigmoid'))

model3.compile(loss='binary_crossentropy', metrics=['accuracy'])
history3 = model3.fit(x_scaled,y, epochs=500, batch_size=32, validation_split=0.2)
Code
plt.plot(history3.history['loss'])
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.title('Mini-Batch GD loss @ 500 epochs', color='brown')
plt.show()

Batch GD vs SGD vs Mini-Batch GD